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The behef propagation (BP) based algorithm is investigated as a potential decoder for both 
of error correcting codes and lossy compression, which are based on non-monotonic tree-like 
multilayer perceptron encoders. We discuss that whether the BP can give practical algorithms 
or not in these schemes. The BP implementations in those kind of fully connected networks 
unfortunately shows strong limitation, while the theoretical results seems a bit promising. 
Instead, it reveals it might have a rich and complex structure of the solution space via the 
BP-based algorithms. 
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1. Introduction 

In today's society, information processing is part of our 
everyday life. As the pool of data available to us grows 
exponentially within the years, it is vital to be able to 
store, recover, and transmit those data in an efficient way. 
With the birth of information theory subsequently to the 
pioneering work of Shannon, methods to efficiently pro- 
cess information start to become widely studied. 

It has been shown that it is possible to ensure error 
free transmission using a non zero code rate up to a 
maximum value which cannot be exceeded without re- 
sulting in an inevitable loss of information. This upper 
bound is known as the Shannon bound. The design of ef- 
ficient and practical codes is still one of the main topics 
of information theory. For example, the Sourlas's code^^ 
asymptotically attains the Shannon bound, which is for 
channels with very small capacity. A interesting feature 
of Sourlas's paper is that it showed the possibility to use 
methods from statistical physics to investigate error cor- 
recting code schemes. Following this paper, the tools of 
statistical mechanics have been successfully applied in a 
wide range of problems of information theory in recent 
years. For instance in the field of error correcting codes 
itself,^^^ as well as spreading codes. ^^^^ 

On the other hand, lossy compression, which is the 
counterpart of lossless compression which seeks error free 
compression, has been also discussed. Its task is to 
compress a given message allowing a certain amount of 
distortion between the original message and the recon- 
structed messages after compression. An efficient lossy 
compression scheme should be able to keep the compres- 
sion rate as large as possible while keeping the distor- 
tion as small as possible. This is a typical trade-off op- 
timization problem between the desired fidelity criterion 
and the compression rate. As in the reference, Shan- 
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non derived an upper bound which gives the optimal 
achievable compression rate for a fixed distortion, i.e., 
a fixed fidelity criterion. Recently, statistical mechanical 
techniques were applied to these kind of problems with 
interesting results. "^^ "'^^^ 

This paper focuses on error correcting code and lossy 
compression where non-monotonic tree-like committee 
machines or parity machines are used as encoder and 
decoder respectively (for a thorough review on these 
kind of neural networks, see the reference^^^). It has 
been analytically shown that in both error correcting 
code and lossy compression cases, this kind of schemes 
can reach the Shannon bound under some specific con- 
ditions.^' While these results are interesting from a 
theoretical point of view, the complexity of a formal en- 
coder/decoder prevents these schemes from being prac- 
tical. A formal way of encoding/decoding information 
would require an amount of time which grows exponen- 
tially with the size of the original message. One possible 
solution is to use the popular belief propagation (BP) 
algorithm in order to approximate the marginalized pos- 
terior probabilities of the appropriate Boltzmann factor 
which describes the behavior of the scheme. 

The BP algorithm is proved to be exact and is guaran- 
teed to converge only for probability distribution which 
can be represented into a factor graph with no loop, i.e., a 
tree. This is not the case for schemes based on the above 
kind of neural networks as they are densely connected 
and necessarily contains loops, i.e., their corresponding 
factor graph is not a tree. 

Nonetheless, the BP is known to give excellent approx- 
imating performance in the case of sparsely connected 
graph and have been successfully applied in decoding low 
density parity check codes (LDPC) for example. On the 
other hand, it is known that the approximation given 
by the BP in the case of more densely connected graphs 
is sometimes more mitigated. Several problems of sub- 
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optimal solutions or simply convergence failures arises. 

However, despite those issues, it is considered that in- 
vestigating the BP algorithm on such kind of densely 
connected schemes is still interesting from a statistical 
physical point of view and provides precious insight into 
the solution space structure of such kind of systems. 
So far, only the BP-based encoders of lossy compres- 
sion, based on both of the low-density generator-matrix 
(LDGM) code^^^ and the simple perceptron,^^^ have been 
discussed. Both BP-based encoders for lossy compression 
based on the multilayer perceptron (MLP) and BP-based 
decoders for error correcting codes based on the MLP 
have never investigated yet. In this paper, we discuss 
that whether the BP can give practical algorithms or 
not in both error correcting codes and lossy compression 
based on the MLP. 

The paper is organized as follows. Section 2 introduces 
non-monotonic tree-like multilayer perceptron networks 
used throughout the paper. Section 3 exposes the frame- 
works of error correcting code and lossy compression. 
Section 4 introduces the belief propagation algorithm 
and section 5 states the results obtained by the algorithm 
in both schemes. Section 6 is devoted to discussion and 
conclusion. 

2. Structure of multilayer percept rons 

In this section we introduce the kind of network we 
will use throughout the paper. Tree like perceptrons were 
already studied thoroughly by the machine learning com- 
munity over the years. It is known that a feed- forward 
network with a single hidden layer made of sufficiently 
many units is able to implement any Boolean function 
between input layer and output. 

The choice to use perceptron like networks for problem 
of information theory was already proposed by Hosaka 
et al.^^^ They used a simple perceptron to investigate a 
lossy compression scheme. One of the most interesting 
feature of their work was the use of the following non- 
monotonic transfer function for the perceptron. 



fk{x) 



1, 
-1, 



\x\ < k 
\x\ > k 



(1) 



where /c is a real parameter, controlling the bias of the 
output sequence. This choice of a non-monotonic trans- 
fer function was inspired by previous well known results 
within the machine learning community such as the im- 
proved storage capacity achieved by non-monotonic net- 
works. They choose this modified version of a reversed 
wedge perceptron (see^^^ for a description of those net- 
works) for several reasons. The first one was motivated 
by the need to be able to control the bias of the output se- 
quence easily (which is achieved by tuning the parameter 
k) . The second reason was motivated by the claim that a 
zero Edwards- Anderson (EA) order parameter is needed, 
thus reflecting optimal compression within the codeword 
space (meaning that codewords are uncorrelated in the 
codeword space). The use of (1) ensures mirror symme- 
try {fk{x) = fk{—x)) and is likely to give rise to a zero 
EA order parameter (see^^^). 

Subsequently, non-monotonic tree-like perceptrons 



were successfully used in a lossy compression scheme 
and error correcting code scheme using the same kind 
of non-monotonic transfer function.^' This paper uses 
the same networks, which are all derived from the general 
architecture given by Figure 1. 

In each of these networks, the coupling vector s is 
split into s = (si, . . . , s^, . . . , Si^-) where each si = 
(s^^, . . . , sj, . . . , s^^^) is a A/"/ ilT- dimensional binary vec- 
tor of Ising variables (i.e.: ±1 elements). In the same 
way, the input vector = {x^^ . . . , cc[^, . . . , cc^), /i G 
{!,••• , M} is also made of A^/i^-dimensional binary vec- 
tor x^ = {x^i^ . . . , x^^, . . . , x^^j^ i) of Ising variables. The 
output of the network is then given by the scalar 
which is also ±1. The sgn function denotes the sign func- 
tion taking 1 for x > and —1 for x < 0. We use the 
Ising expression (bipolar expression) {1,-1, x} instead 
of the Boolean expression {0, 1, -h ( mod2)} to simplify 
calculation. Consequently, the Boolean is mapped onto 
1 in the Ising framework while the Boolean 1 is mapped 
to —1. This mapping can be used without any loss of 
generality. We investigate three different networks which 
are given by the followings: 

(I) Multilayer parity tree with non-monotonic hidden 
units (PTH). 



(2) 



(II) Multilayer committee tree with non-monotonic 
hidden units (CTH). 



y^s) 





■X7 



(3) 



Note that in this case, if the number of hidden units 
K is even, then there is a possibility to get for the 
argument of the sign function. We avoid this uncertainty 
by considering only an odd number of hidden units for 
the committee tree with non-monotonic hidden units in 
the sequel. 

(Ill) Multilayer committee tree with a non-monotonic 
output unit (CTO). 



2/^(s)^/J./-Vsgn . (4) 




3. Frameworks 

3. 1 Error correcting codes using multilayer perceptrons 
In this section we show how non-monotonic tree-like 
perceptron can be used in an error correcting code 
scheme. 

In a general scheme, an original message G { — 1,1}^ 
of size N is encoded into a codeword 2/0 G { — 1,1}^ of 
size M by some encoding device. The aim of this stage 
is too add redundancy into the original data. Therefore, 
we necessarily have M > N. Based on this redundancy, a 
proper decoder device should be able to recover the orig- 
inal data even if it were corrupted by some noise in the 
transmission channel. The quantity R = N/M is called 
the code rate and evaluates the trade-off between redun- 



J. Phys. Soc. Jpn. 



Full Paper 



Author Name 3 




1-r 



Fig. 1. General architecture of the treehke multilayer perceptron 
with N input units and K hidden units. 



dancy and codeword size. The codeword yo is then fed 
into a channel where the bits are subject to some noise. 
The received corrupted message y G { — 1,1}^ (which 
is also M dimensional) is then decoded using its redun- 
dancy to infer the original N dimensional message s^. In 
other words, in a Bayesian framework, one try to maxi- 
mize the following posterior probability. 



P{s\y)(xP{y\s)P{s). 



(5) 



As data transmission is costly, generally one wants to be 
able to ensure error free transmission while transmitting 
the less possible bits. In other words, one wants to ensure 
error free transmission keeping the code rate as close as 
possible to the Shannon bound. 

In this paper we assume that the original message is 
uniformly distributed on { — 1,1}^ and that all the bits 
are independently generated so that we have 

Pis°) = ^. (6) 

The channel considered is the Binary Asymmetric Chan- 
nel (BAG) where each bit is flipped independently of the 
others with asymmetric probabilities. If the original bit 
fed into the channel is 1, then it is flipped with proba- 
bility r. Conversely, if the original bit is —1, it is flipped 
with probability p. Figure 2 shows the BAC properties. 
The binary symmetric channel (BSC) corresponds to the 
particular case where r = p. 

Finally, the corrupted message y is received at the 
output of the channel. The goal is then to find back 
using y. The state of the estimated message is de- 
noted by the vector s. The general schematic outline 
of the scheme is shown in Figure 3. From Figure 2 we 
can easily derived the following conditional probability, 
P{y^\yo) = ^ + ^[(1 -r- p)y^ + (r - p)], where we 
make use of the notations y^ = (^/q, . . . , ' • • • ' y¥)^ 
y = {y^ ^ . . . , ^ . . . , y^). Since we assume that the 
bits are flipped independently, we deduce P{y\yQ) = 
n^=i P{y^\yo)- To encode the original message into 
a codeword yo^ we make use of the non-monotonic tree- 
like parity machine or committee machine neural net- 
works already introduced. We prepare a set of M input 
vectors {x^ ^ . . . , cc^, . . . , x^) which are drawn indepen- 
dently and uniformly on { — 1,1}. This will play the role 
of the codebook. The original message is used as the 




yo=-i 



Fig. 2. The Binary Asymmetric Channel (BAC) 

coupling vector of the network. Then, each input vector 
x^ is fed sequentially into the network generating a cor- 
responding scalar y^ at the output of the network flnally 
resulting in a M-dimensional vector y^. This gives us the 
codeword to feed into the channel. 

The use of random input vectors is known to maxi- 
mize the storage capacity of perceptron's network and 
since each y^ is computed using the whole set of origi- 
nal bits s^, redundancy is added into the codeword. This 
makes such kind of scheme promising for error correcting 
task. A formal decoder should be able to decode the re- 
ceived corrupted message y by maximizing the posterior 
probability p{s\y), that is 



argmax p{s\y). 



(7) 



To keep notation as general as possible, as long as explicit 
use of the encoder is not necessary in computations, we 
will denote the transformation perform on the vector s 
by the respective tree-like perceptrons using the nota- 
tion J^k{\J^si -xf). Here Tk takes a different expression 
for the three different types of network and this notation 
means all encoders depends on a real threshold parame- 
ter k. 

Since the relation between an arbitrary message s and 
the codeword fed into the channel is deterministic, for 
any s, we can write P{y\s) = H^!^!!^ + ^[(1 - ^ - 

p)Tk{^J^si ■ xf) -\- {r — p)]}. We flnally get the explicit 
expression of the joint probability of the model as 

Li=l ^ 




+ (r 



(8) 



The typical performance of this scheme was already 
studied using the Replica Method (RM)^^ and it was 
shown that each of the three proposed network can reach 
the optimal Shannon bound at the inflnite codeword 
length limit (when N ^ oo and M ^ oo while the code 
rate R is kept flnite) under some speciflc condition. 

The PTH and the CTH were shown to reach the Shan- 
non bound for any number of hidden units K (any odd 
number of hidden units in the case of the CTH) if the 
threshold parameter k of the non-monotonic transfer 
function is properly tuned. The CTO was shown to reach 
the Shannon bound when its number of hidden units K 
becomes inflnite and with a properly tuned threshold pa- 
rameter k only. 
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Fig. 3. Layout of the error correcting code scheme 

3.2 Lossy compression using multilayer perceptrons 

In this section we introduce the framework of lossy 
data compression^ and how non-monotonic tree- like 
perceptrons can be used for this purpose. 

Let 2/ be a discrete random variable defined on a source 
alphabet 3^. An original source message is composed 
of M random variables, y = (2/^,...,^^) G 3^^, and 
compressed into a shorter expression. The encoder com- 
presses the original message y into a codeword s, using 
the transformation s = J~'{y) ^ where N < M. 
The decoder maps this codeword s onto the decoded 
message y, using the transformation y = Q{s) G . 
The encoding/decoding scheme can be represented as 
in Figure 4. In this case, the code rate is defined by 
R = N/M. A distortion function d is defined as a map- 
ping d : y X y ^ IR+. For each possible pair of {y,y), it 
associates a positive real number. In most of the cases, 
the reproduction alphabet y is the same as the alphabet 
y on which the original message y is defined. 

Hereafter, we set y = y, and we use the Hamming 
distortion as the distortion function of the scheme. This 
distortion function is given by 



d{y,y) 



0, 
1, 



y = y^ 



(9) 



so that the quantity d{y^y) = Xl/i=i d{y^^y^) measures 
how far the decoded message y is from the original mes- 
sage y. In other words, it records the error made on 
the original message during the encoding/decoding pro- 
cess. The probability of error distortion can be written 
E[d{y,y)] = P[y ^ y] where E represents the expecta- 
tion. Therefore, the distortion associated with the code 
is defined as D = E[^d{y, y)]^ where the expectation is 
taken with respect to the probability distribution P[y,y]. 
D corresponds to the average error per variable y^. Now 
we defined a rate distortion pair (i?, D) and we said that 
this pair is achievable if there exist a coding/decoding 
scheme such that when M ^ 00 and N ^ 00 (note that 
the rate R is kept finite), we have E[j^d{y^y)] < D. 
In other words, a rate distortion pair (i?, D) is said 
to be achievable if there exist a pair (J^, Q) such that 
E[j^d{y, y)] < D in the limit M ^ 00 and N ^ 00. 

The optimal compression performance that can be ob- 
tained in the framework of lossy compression is given by 
the so-called rate distortion function R{D) which gives 
the best achievable code rate as a function of D (Shan- 
non bound for lossy compression). However, despite the 
fact that the best achievable performance is known, as in 
the error correcting code case, no clues are given about 
how to construct such an optimal compression scheme. 

In this paper we assume that the original message 



y = (t/-*^, . . . , ?/^, . . . , y^) is generated independently by 
an identically biased binary source, so that we can easily 
write the corresponding probability distribution, 

P[y^] = p5iy^ - 1) + (1 - p)S{y^ + 1), (10) 

where p corresponds to the bias parameter. The encoder 
is simply defined as follows. 



T{y)= argmin d{y,g{s)). 

sG{-l,l}^ 



(11) 



Next, to decode the compressed message s we make 
use of the already introduced tree- like perceptrons. As 
in the error correcting code scheme, we prepare a set of 
M input vectors (cc^, . . . , x^, . . . , x^) which are drawn 
independently and uniformly on { — 1,1}. This will play 
the role of the codebook. The compressed message s is 
used as the coupling vector of the network. Then, each 
input vector x^ is fed sequentially into the network gen- 
erating a corresponding scalar y^ at the output of the 
network finally resulting in a M- dimensional vector y. 
This gives us the reconstructed message which should 
satisfies E[j^d{y, y)] < D where D is the desired fidelity 
criterion which measure the amount of error between the 
reconstructed message y and the original message y. 

To keep notation as general as possible, as long as ex- 
plicit use of the decoder is not necessary in computations, 
we will again denote the transformation perform on the 
vector s by the respective tree-like perceptrons using the 

notation J^ki^j^si - x^). 

The encoding phase can be viewed as a classical per- 
ceptron learning problem, where one tries to find the 
weight vector s which minimizes the distortion function 
d{y^y) for the original message y and the random input 
vector X. The vector s which achieve this minimum gives 
us the codeword to be send to the decoder. Therefore, in 
the case of a lossless compression scheme(i.e.: D = 0), 
evaluating the rate distortion property of the present 
scheme is equivalent to finding the number of couplings 
s which satisfies the input/output relation x^ ^ y^. In 
other words, this is equivalent to the calculation of the 
storage capacity of the network. ^^'^^^ 

The typical performance of this scheme was already 
studied using the Replica Method (RM)"*^^^ and it was 
shown that each of the three proposed network can reach 
the optimal Shannon bound at the infinite codeword 
length limit (when N ^ oo and M ^ oo while the code 
rate R is kept finite) under some specific condition. 

The PTH and the CTH were shown to reach the Shan- 
non bound for any number of hidden units K (any odd 
number of hidden units in the case of the CTH) if the 
threshold parameter k of the non-monotonic transfer 
function is properly tuned. The CTO was shown to reach 
the Shannon bound when its number of hidden units K 
becomes infinite and with a properly tuned threshold pa- 
rameter k only. 

4. Belief-propagation-based algorithms 

In this section we briefly introduce the BP algorithm 
and how it can be used to infer an approximation of 
the marginalized posterior probabilities. The BP or sum- 
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Fig. 4. Rate distortion encoder and decoder. 



product algorithm is originally designed to compute ex- 
act marginalization on a factor graph which is a tree. 
However it is known to give very good performance even 
for non tree factor graph in various cases. For a formal 
introduction of the BP algorithm, see.^^'^^^ 

So far, the BP algorithm was already applied by 
Hosaka et al. in the case of lossy compression using the 
simple perceptron,^^^ but not in the case of the MLP. 
We follow the footsteps of their work to investigate both 
of the BP decoder for error correcting code and the BP 
encoder for lossy compression, which are based on multi- 
layer perceptrons. It should be noted that we can discuss 
a basic BP algorithm for error correcting codes and lossy 
compression at a time. For the BP to be used, we need 
to have a factorizable probability distribution. Based on 
the statistical mechanical framework used in the refer- 
ences,^' the posterior probability of each case (either 
in the error correcting code scheme and lossy compres- 
sion scheme) can be represented by a Boltzmann distri- 
bution 



P{s\ydx};f3) 



exp[-(3n{s,y,{x})] 



(12) 



where 1-L{s^y^{x}) denotes the relevant Hamiltonian and 
Z{y,{x};f3) the relevant partition function. The nota- 
tion {x} denotes the fact that the random vectors x^ 
are already fixed and known, which are random quenched 
variables. 

In order to use the BP algorithm, this Boltzmann dis- 
tribution can be factorized such that the Boltzmann fac- 
tor can be decomposed into 

eM-m{s,y,{x})] = n^'^.M({\/f«'-<})' (13) 

where the expression of the function Gk,ij, depends on 
the scheme considered. In Appendix A, the derivaton 
of the BP-based decoders for error correcting codes are 
given. In Appendix B, the BP-based encoders for lossy 
compression are derived. Following from this assumption, 
we can write down the factor graph representation of the 
Boltzmann distribution as a bipartite graph (Figure 5), 
In the BP, it is assumed that the secondary contribu- 
tion of a single variable s] oi is small and must be ne- 
glected. Under this assumption, the factor graph shown 
in Fig. 5 is regarded as having a tree- like architecture. 
Now let us write down the set of messages flowing from 
the source sequence to the codeword and vice versa. We 
then have the following equations: 



8\{si} ^ I 





N/K elements 

I 



N/K elements 

I 



N elements 



Fig. 5. Factor graph of the Boltzmann distribution 
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n pU'^ 



(si) 
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n 



K N/K 

n n pii'i'i4i]^ 

I'^tl i' = l 



(15) 



where C/j^u denotes the relevant normalization constant 
and qli{sl) denotes the prior, p^^nisl) denotes the message 
received by the random variable s] from the source se- 
quence bit at time step t. p^^il{s\) denotes the message 
sent by the random variable s\ to the source sequence bit 
y^ at time step t + 1 . At time t + 1 , the pseudo posterior 
marginals is given as 



p'+\s\\y,{x};l3) = 



-Hp 



(s|2/,M;/3) 



M 



where Cu denotes the relevant normalization constant. 
We obtain the BP-based algorithm as follows: 



tanh 




2 9',(-l) 



where we have inserted back the term depending on the 
prior and we put p^{sl\y^ {x}; f3) = |(1 + m\is\). Detail 
of calculation and definitions both and 6^ ^ are 

available in Appendix A. The MPM estimator at time 
step t is given by 

s\ = sgn{m\i). (18) 

This BP algorithm requires 0{N'^) operations for each 
step. 

5. Empirical performance 

5.1 Error correcting code case 

In this section we show the results we obtain by using 
the BP algorithm as a decoder of the scheme. 

In the case of error correcting codes, the Edwards- 
Anderson order parameter g is g = 1 in the ferromagnetic 
phase, implying that | {si) p = 1, where (...) denotes the 
average with respect to y and x. This means that a sim- 
ple uniform prior can be used efliciently and there is no 
uncertainty about the sign of s\. However, as it will be 
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discussed further with the lossy compression case, we in- 
troduce a more refine prior, so-cahed an inertia term, of 
the following form 

ql^{si) = e"^^^"^"'(^^^'\ (19) 

where < 7 < 1 denotes an amplitude of the inertia 
term. Note that 7 is set by trial and error. This method 
was already successfully applied by Murayama^^^ for a 
lossy compression scheme. In the sequel, if nothing is 
explicitly precised about 7, then it means that we used 
7 = 0, corresponding to the simple uniform prior. 

The general procedure is as follows. In each case, the 
threshold parameter k is set to the optimal theoretical 
value. First, an original message is generated from 
the uniform distribution. Then the original message is 
turned into a codeword yo using the relevant network. 
The codeword is then fed into the binary asymmetric 
channel where it is corrupted by noise according to the 
parameters p and r. The decoder receives the corrupted 
codeword y at the output of the channel. The BP is 
finally used to infer back the original message using 
the corrupted codeword y. The BP-based decoders are 
shown in Appendix A. 

We conducted two types of simulations. In the first 
one, the number of hidden units the size of the origi- 
nal message and the parameters (p, r) of the BAG are 
kept constant. The changing parameter is the size of the 
codeword M which results in different values for the code 
rate R = N/M. For each value of R tested, we perform 
100 runs. For each run, we perform 100 BP iterations 
and the resulted estimated message s is compared with 
the original one using the overlap value -^s • s^. The 
code rate is plotted against the mean value of the over- 
lap. The author are well aware that in general, informa- 
tion theorists plot the performance of an error correcting 
code scheme using error probability plot in logarithmic 
scale. However, the present BP calculations still requires 
a computational cost of order 0{N'^) which prevent such 
drawing to be feasible. On top of that, the author be- 
lieves that the main interest of the present schemes at 
the present state of research is from a theoretical point 
of view rather than a practical point of view. The per- 
formance plot intends to give an general idea about the 
typical performance obtained using the BP with these 
schemes but does not aim at discussing possible prac- 
tical implementation of these schemes. We believe the 
performance exhibited by these schemes at the present 
time to be too limited to be worth such discussion. 

In the second type of experiment, we try to shed light 
on the structure of the solution space. For this purpose, 
we fix the value of K, A/", M,p, r and generate an origi- 
nal message s^. We let run the BP algorithm and get a 
estimated message s after 100 iterations. Then we keep 
the same original message and let run the BP again but 
with different initial values. After 100 iterations we get 
another estimated message s' . We perform the same pro- 
cedure 30 times and we calculate the average overlap 
j^s-s' between all the obtained estimated messages. Next 
we generate a new original message and do the same 
procedure for 50 different original messages. We finally 



plot the obtained overlap using histograms, thus reflect- 
ing the distribution of the solution space. 

5.1.1 Parity tree with non-monotonic hidden units 
(PTH) 

We show the results obtained for the PTH with K =1 
and K = hidden units in Figure 6. The vertical line 
represents the Shannon bound, that is the theoretical 
limit for which decoding is still successful (i.e.: overlap 
is 1). The average overlap for 100 trials is plotted. While 
the Shannon bound gives a theoretical optimal code rate 
of R ~ 0.4, in this case for K = 1, the performance of 
the BP starts to deteriorates rapidly for R > 0.25. This 
shows limitation of the BP performance. We tested sev- 
eral configuration with different value for p and r (BSC 
case and Z channel case), and the general tendency is 
always the same. Far from the Shannon bound, the per- 
formance deteriorates rapidly. 

Next the same experiment with K = 3 hidden units 
shows that the BP fails completely to decode the cor- 
rupted codeword. The average overlap is even for low 
value of R. We always got the same results for any value 
of p and r. In fact, for any i^T > 1, it seems that the BP 
always fails to converge to any relevant solution. This re- 
sult is surprising and might indicates that the number of 
suboptimal states is so important that this prevent the 
BP to work. 

Then we try to investigate the structure of the solu- 
tion space. We plot the histograms of the overlap of the 
solutions obtained using the BP (when K = 1) in Figure 
7 (a). In this case, we see that the BP converges to two 
different solutions with opposite sign which corresponds 
to zbs^. This is normal and comes from the mirror sym- 
metry of the function fk- In this case the solution space 
is simple, with two dominant at tract or given by and 

Then we perform the same experiment but with K = 3 
and N = 102. Results are plotted in Figure 7 (b). We 
obtain a Gaussian like distribution centered on 0. This 
means that the solution given by the BP are almost un- 
correlated between each others. They do not correspond 
to any relevant solution and the empirical overlap is al- 
most 0. We then conduct the same experiment keeping 
the code rate unchanged but for an original message of 
1000 bits. Results are shown in Figure 7 (c). The distri- 
bution becomes sharper, centered on 0, meaning that the 
solutions given by the BP are completely uncorrelated. 
The number of suboptimal states becomes very large and 
the BP completely fails to converge to a relevant solu- 
tion. 

To conclude the case of the PTH, we can say that for 
K = 1^ the BP converges but with performance far from 
being Shannon optimal. For K > 1, the BP completely 
fails. This is probably due to a rise of suboptimal states 
when using more than 1 hidden unit. 

5.1.2 Committee tree with non-monotonic hidden units 
(CTH) 

We show the results obtained for the GTH with K = 3 
hidden units in Figure 8 (a). We do not show the result 
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Fig. 6. Empirical performance of the BP-based decoder for error 
correcting codes using the PTH with K = 1 (sohd) and K = 3 
(dashed). We set p = 0.1, r = 0.2 and 7 = (set by trial and 
error) and used = 1000 (for K = 1) and = 999 (for K = 3). 
The vertical line represents the Shannon bound. 
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(a) K = 1, N = 1000 




-1 -0.5 0.5 1 

Overlap (1/N * S ■ S ') 

(h) K = 3, N = 102 




-1 -0.5 0.5 1 

Overlap (1/N * S ■ S ') 



(c) K = 3, N = 1002 

Fig. 7. Overlap of the solutions given by the BP-based decoder 
for error correcting codes using the PTH with = 0.25, p = 0.1, 
r = 0.2 and j = 0. (a) K = 1 and A^ = 1000. The empirical 
overlap with the original message is 0.97. (h) K = 3 and A^ = 
102. The empirical overlap with the original message is 0.08. (b) 
K = 3 and A" = 1002. The empirical overlap with the original 
message is 0.03. 

for K = 1 because in this case, the CTH is equivalent 
to the PTH. The vertical line represents the Shannon 
bound. The average overlap for 100 trials is plotted. 

In this case it is interesting to note that for 7 = and 
R < 0.15, the BP fails to properly recover the original 
message but still seems to converge to some meaning- 
ful state. The average overlap is around 0.75 but never 



reaches 1. This is probably due to local suboptimal at- 
tractors in the solution space. Adding a perturbation by 
inserting a non zero inertia term seems to be a good 
way to escape those suboptimal states and the best per- 
formance are obtained for 7 around 0.45. However, for 
R < 0.15 whatever the value 7 takes the performance 
quickly deteriorates. So the performance are very far 
from being optimal and suggest that the bigger the code 
rate is, the larger the number of suboptimal states are. 

We then conduct the exact same experiment but for 
K = 5. The results are shown in Figure 8 (b) and are 
almost identical to the results obtain for K = 3. We then 
make a comparison of the best performance obtained us- 
ing the CTH. Results are shows in Figure 8 (c). 

The best performance are obtain for = 1. Increasing 
the number of hidden units clearly yields poorer perfor- 
mance. This is an interesting phenomenon and the only 
explanation is that the number of hidden units have a 
critical influence and the solution space structure. While 
theoretically any number of hidden units should be able 
to yield optimal performance, the BP clearly gives bad 
results for K > 1. In a similar way as the study we have 
introduced in the first part of this paper. It is very likely 
that the intrinsic structure of MLPs is at the origin of this 
ill behavior. The number of hidden units seems to play a 
critical role in the organization of the solution space, and 
probably give rise to some complex geometrical features. 

Then we try to investigate the structure of the solution 
space. First, we plot the histograms of the overlap of the 
solutions obtained using the BP with K = 3^ N = 999 
and 7 = in Figure 9 (a). Then we plot the histograms of 
the overlap of the solutions obtained using the BP with 
K = 3, N = 999 and 7 = 0.45 in Figure 9 (b). 

For 7 = 0, we obtain seven peaks. Two tall peaks at 
±1/3, one peak at 0, and four small peaks at ±2/3 and 
±1. The peaks located at ±1 and ±1/3 corresponds to 
successful decoding and reflect the possible combination 
of decoded messages when K = 3. Indeed, because of the 
mirror symmetry in the CTH network, any combination 
of ±s[^ gives the same output. We therefore have a inher- 
ent indetermination on the original message, which can 
be easily removed by adding some simple header to the 
codeword. 

The two small peaks around ±2/3 corresponds to a 
partial success in decoding. Indeed, further investigation 
showed that those peaks correspond to codewords where 
two of the three s'^ vectors have been successfully re- 
trieved but the last vector was not. This means that 
the BP remained trapped in some local attractor, which 
probably depends on the initial values used by the BP. 
The interesting fact is that it affects only partially the 
BP performance in this case, showing that for the CTH 
the BP dynamics of each si is independent to the others 
to some extent. Finally, the peak around reflect a com- 
pletely unsuccessful decoding. This explains the average 
overlap found of 0.74. 

In the K = 3 system, for a given original message s^, 
eight messages 

(82,-52,-53), ( — S]^, ^2, S3), ( — S]^, ^2, — S3), 

{-S°,-Sl-S°^)} 4 5(5°), which 
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includes the original message = (s^^ -^2 5 -^3)5 
mapped into a same codeword. So an additional K 
bit information is necessary to specify the original 
message from the set S{s^). For instance, it is one of 
the additional information to add 1 to each block / 
as 1-^ (l?-^?) "to specify the original message (the 
length of this information is negligible than the length 
of the original message). If the BP decoder correctly 
estimates the original message , the estimated message 
is identical to one of the element of the set S{s^) with 
equiprobability. Therefore, when the BP decoder esti- 
mates correctly, the histgram exhibits only four peaks 
located at ±1 (probability 1/8) and ±1/3 (probability 

3/8). 

The case where 7 = 0.45 on the other hand, exhibits 
only four peaks located at ±1 and ±1/3. Therefore this 
means that decoding is always successful in this case 
as confirmed by the average overlap of 0.99. This result 
shows that using a non-zero inertia term can be an ef- 
ficient way of avoiding sub-optimal states by adding a 
small perturbation to the BP dynamics. 

To conclude for the CTH, it is very clear that using a 
number of hidden unit greater than 1 is at the origin of 
some structural changes of the solution space, which pro- 
vokes a dramatic performance drop. For = 1, success- 
ful decoding is ensured until R = 0.25 while for i^T = 3, 
successful decoding is ensured until R = 0.15 only. How- 
ever, between K = 3 and = 5, we observe no sub- 
stantial change. It seems that as R increases, the sub- 
optimal states' basin of attraction quickly becomes very 
large compared to the optimal solution one. The influ- 
ence of the number of hidden units on the solution space 
geometry remains to be investigated in a future work. 

5.1.3 Committee tree with a non-monotonic output unit 
(CTO) 

We show the results obtained for the CTO with K = 2^ 
K = 3 and K = 5 hidden units in Figure 10. We do 
not show the result for = 1 because the CTO can- 
not be defined in this case. The vertical line represents 
the Shannon bound. The average overlap for 100 trials 
is plotted. For K = 2 the BP successfully decodes the 
corrupted codeword until R = 0.17 only and beyond this 
value the performance gradually decreases. Compared to 
the PTH/CTH with K = 1, the CTO performance are 
poorer. However, this is not very surprising because as 
mentioned during the analytical study of the CTO case, 
the CTO is expected to reach the Shannon bound for an 
infinite number of hidden units only. The fact that we 
get suboptimal performance for finite K is therefore not 
surprising. For K = 3, the average performance is better 
and decoding is successful until R = 0.2. In this case, it 
is worth using an extra unit. However, for iiT = 5, the 
performance deteriorates and we get poorer performance 
than K = 1. Nonetheless, the overall performance is still 
better than the CTH with the same number of hidden 
units. 

Then we try to investigate the structure of the solution 
space. We plot the histograms of the overlap of the solu- 
tions obtained using the BP with K = 2 and N = 1000 in 
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(c) K G {1,3,5} 

Fig. 8. Empirical performance of the BP-based decoder for error 
correcting codes using the CTH with p = 0.1 and r = 0.2. The 
vertical line represents the Shannon bound, (a) K = 3 and N = 
999. The dashed line is for 7 = 0, the dotted line is for 7 = 0.45. 
(h) K = 5 and = 1000. The dashed line is for 7 = 0, the 
dotted hue is for 7 = 0.45. (c) K = 1 (solid), K = 3 (dashed) 
and K = 5 (dotted) hidden units. We set = 1000 for K = 1,5 
and = 999 for K = 3.We chose 7 = for K = 1 and 7 = 0.45 
for K = 3, 5, which are set by trial and error. 

Figure 11 (a). We obtain three sharp peaks at ±1 and 
and two small peaks around ±2/3. The three sharp peaks 
correspond to successful decoding. As in the CTH case, 
their positions correspond to the possible combination 
of However, the other two small peaks corresponds 
to suboptimal states (the average overlap is 0.76) and it 
is unclear what the value ±2/3 denotes. It might corre- 
sponds to some particular local attractor which should 
be investigated in the future. 

Next we plot the histograms of the overlap of the solu- 
tions obtained using the BP with K = 3 and N = 999 in 
Figure 11 (b). For = 3, we obtain two sharp peaks at 
±1 and a rather flat distribution connecting them (with 
a small concentration around 0). The two sharp peaks 
correspond to successful decoding, while the rest of the 
distribution indicates suboptimal states. However, here 
there is no particular suboptimal states as in the case 
when K = 2. This particularity is interesting and re- 
mains to be investigated. 

To conclude the case of the CTO, we can say that 
the BP reaches optimal performance for = 3 but de- 
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Fig. 9. Overlap of the solutions given by the BP-based decoder 
for error correcting codes using the CTH with K = 3, N = 999, 
M = 6660, R = 0.15, 7 = 0, = 0.1 and r = 0.2. (a) 7 = 0. The 
empirical overlap with the original message is 0.74. (b) 7 = 0.45. 
The empirical overlap with the original message is 0.99. 



Fig. 11. Overlap of the solutions given by the BP-based decoder 
for error correcting codes using the CTO with K = 2, N = 1000, 
M = 4000, R = 0.25, 7 = 0, p = 0.1 and r = 0.2. (a) K = 2 
and N = 1000. The empirical overlap with the original message 
is 0.76. (h) K = 3 and = 999. The empirical overlap with the 
original message is 0.63. 
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Fig. 10. Empirical performance of the BP-based decoder for error 
correcting codes using the CTO with K = 2 (solid), K = 3 
(dashed) and K = 5 (dotted). We set p = 0.1, r = 0.2 and 7 = 
(set by trial and error) and used = 1000 (for K = 2 and 
K = 5), N = 999 (for K = 3). The vertical line represents the 
Shannon bound. 

coding is still far from being Shannon optimal. However, 
as the analytical study already mentioned, the CTO is 
expected to yield Shannon performance when using an 
infinite number of hidden units so it is a little bit hard 
to explain the results of this section. Nevertheless, one 
may expect the performance to get better and better a 
K increases but this is not the case as denoted by the 
case when = 5. As for the other networks, it is very 
likely that the solution space exhibits strange geometri- 
cal features for > 1 (explaining the rise of suboptimal 
states), preventing the BP to converge properly for large 
values of K. 

5.2 Lossy compression case 

In this section we show the results we obtained by 
using the BP algorithm as an encoder of the scheme. 

In the case of lossy compression, the Edwards- 
Anderson parameter q vanishes as discussed in the ref- 



erences,-^^' -^^^ implying that | {si) p = (where (. . .) de- 
notes the average with respect to y and x). This means 
that it is not possible to determine the most probable 
sign of s\. To avoid this uncertainty we again introduce 
a particular prior of the form 

4i{s\) = e^Jt-h-(7m^,)^ (20) 

where < 7 < 1 denotes an amplitude of the inertia 
term. Note that 7 is set by trial and error. This method 
was already successfully applied by Murayama.^^^ 

The general procedure is as follows (in each case, the 
threshold parameter k is set to the optimal theoretical 
value, see the reference^^^). First, an original message y 
is generated from the distribution (10). Then the original 
message is turned into a codeword s using the BP-based 
algorithms which are shown in Appendix B. The code- 
word is subsequently decoded into y using the proper 
tree-like multilayer perceptron decoder network. The dis- 
tortion between the decoded message y and the original 
message y is then computed. 

We conducted two types of simulations. In the first 
one, the number of hidden units the size of the code- 
word and the bias parameters p of the distribution 
(10) are kept constant. The changing parameter is the 
size of the original message M which results in different 
values for the code rate R = N/M . For each value of R 
tested, we perform 100 runs. For each run, we perform 
35 BP iterations and the resulted estimated codeword s 
is then decoded into y. The distortion between y and y 
is then computed. The code rate is plotted against the 
mean value of the distortion. 

The second type of experiment is exactly the same as 
in the error correcting case. We fix the value of M, p 
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and generate an original message y. We let run the BP 
algorithm and get a codeword s after 35 iterations. Then 
we keep the same original message and let run the BP 
again but with different initial values. After 35 iterations 
we get another codeword s^ We perform the same proce- 
dure 30 times and we calculate the average overlap j^s-s' 
between all the obtained codewords. Next we generate a 
new original message y and do the same procedure for 
50 different original messages. We finally plot the ob- 
tained average overlap using histograms, thus reflecting 
the distribution of the codeword space. 

5.2.1 Parity tree with non-monotonic hidden units 
(PTH) 

We show the results obtained for the PTH with K = 1 
and K = 3 hidden units for unbiased message (i.e.: 
p = 0.5) an biased message with p = 0.8 in Figure 
12. The solid line represents the rate distortion function 
corresponding to the Shannon bound, that is the lowest 
achievable distortion for a given code rate R. The average 
distortion for 100 trials is plotted. 

For = 1, the results are quite far from the Shannon 
bound for large code rate but approaches it for small 
ones (for both p = 0.5 and p = 0.8). We find the same 
results as in Hosaka et al.^^^ Then, the same tendency 
can be shown for biased messages with p < 0.5 but in 
those cases, for symmetry reasons, we should use —fk 
as a transfer function which gives slightly different BP 
equations (some signs change). Hence, for simplicity we 
restrict the present study to biased messages with p > 
0.5. 

The result for unbiased message and K = 3 are ex- 
tremely bad and the BP does not seems to converge to 
any relevant codeword. This is surprising. On top of that, 
while the performance are also poorer than K = 1 for bi- 
ased messages, it is not as extreme as for the unbiased 
case. The reasons for such a behavior are not very clear. 
The codeword space structure is again clearly affected 
when more than one hidden unit is used and is likely to 
perturb the BP dynamics. Bias in the original message 
seems to be another factor to take into account. 

Then we try to investigate the structure of the code- 
word space. We plot the histograms of the overlap of the 
codewords obtained using the BP for i^T = 1, A/" = 100, 
R = 0.4 and p = 0.5 in Figure 13 (a). In this case, it 
is interesting to note that despite one might believe, the 
BP does not converge to two different solutions. As dis- 
cussed in the reference, -^^^ for K = 1, we have at least 
two optimal codewords ±s. Therefore, one might expect 
to see two peaks concentrated around ±1 but this is not 
the case. There are two very small peaks around ±1 and 
one large peak with its center around 0. This implies that 
there are many codewords completely uncor related which 
share very similar distortion properties. To confirm this 
conjecture, we perform exactly the same experiment but 
with a larger codeword size N = 1000. Results are shown 
in Figure 13 (b). 

This time, the small peaks around ±1 completely van- 
ish and we have a Gaussian like distribution centered 
on 0. This confirms the fact that there is a very large 



amount of uncorrelated codewords sharing the same dis- 
tortion properties. This is a surprising result. 

We perform the same type of experiment but with K = 
3. We first consider unbiased messages {p = 0.5). We 
show the result for A^ = 102 only because there is no 
major change with larger value of A" in this case. Results 
are plotted in Figure 14 (a). We obtain a Gaussian like 
distribution centered on 0. This means that the solution 
given by the BP are almost uncorrelated between each 
others. This time they do not correspond to any relevant 
solution as indicated by the empirical distortion which 
is close to 0.5 meaning completely random codewords. It 
seems that for K > 1 and for unbiased messages {p = 

0. 5), the number of suboptimal states becomes very large 
and the BP fails to converge to any relevant codeword. 

Then we consider biased messages {p = 0.8). Figure 14 
(b) shows the results for A" = 102. In this case we have 
two peaks located at ±1/3 linked by a rather high plateau 
and two small peaks at ±1. The peaks location corre- 
sponds to the 2^ possible combinations of codewords 
ensured by the structure of the network (discussed in 
the reference^^^). This means that in many cases the BP 
converge to one of this possible 2^ combination. How- 
ever the rather high plateau centered on shows that 
the BP converges many time to uncorrelated codewords. 
This means that on top of the 2^ codewords sharing the 
same distortion properties, we have a large number of un- 
correlated codewords which share rather similar distor- 
tion properties. We decide to investigate the same case 
but with a larger value of A". Figure 14 (c) shows the 
result for A" = 999. This time, the peaks completely van- 
ish and we obtain a Gaussian like distribution centered 
on 0. The empirical distortion obtained 0.101 shows that 
the BP converges to a relevant solution (even if not op- 
timal). This shows that as A" gets larger, the number of 
uncorrelated codewords sharing similar distortion prop- 
erties becomes extremely large. This is an interesting fea- 
ture. However, the results are not Shannon optimal and 
as K increases, the results for biased messages becomes 
smoothly worse and worse. Nevertheless, the reason why 
the BP fails to work for unbiased messages when K > 1 
is still unclear. 

To conclude the case of the PTH, we can say that for 
K = 1, the BP converges but with relatively poor per- 
formance. The codeword space exhibits an interesting 
structure, showing that many uncorrelated codewords 
share very similar properties. As the codeword length 
gets larger, the number of these codewords sharing very 
similar distortion properties seem to increase dramati- 
cally. For K > 1, the performance smoothly deteriorates 
for biased messages but for near unbiased ones, the BP 
fails. This is probably due to the rise of suboptimal states 
when using more than 1 hidden unit. The geometrical 
structure of the codeword space remains to be investi- 
gated. 

5.2.2 Committee tree with non-monotonic hidden units 
(CTH) 

We show the results obtained for the CTH with K = 

1, K = 3 and K = b hidden units for unbiased and 
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Fig. 12. Empirical performance of the BP-based encoder for lossy 
compression using the PTH with K = 1 and K = 3 for unbiased 
messages (p = 0.5, on the top of the figure) and biased message 
(p = 0.8, on the bottom). Dashed lines are for K = 1 and dotted 
lines are for K = 3. We used = 1000 for K = 1, and = 999 
for K = 3. The inertia term 7 = 0.45 was set by trial and 
error. The solid lines give the Shannon bound. The top one is for 
p = 0.5, the bottom one is for p = 0.8. 
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(a) K = 1, N = 100, p = 0.5 
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(h) K = 1, N = 1000, p = 0.5 

Fig. 13. Overlap of the solutions given by the BP-based encoder 
for lossy compression using the PTH with R = 0.4 and 7 = 0.45 
which is set by trial and error. The Shannon bound is 0.15 for 
p = 0.5 and 0.057 for p = 0.8. (a) K = 1 and N = 100 and 
p = 0.5. The empirical distortion over the trial is 0.21. (h) K = 1 
and A" = 1000 and p = 0.5. The empirical distortion is 0.19. 

biased messages in Figure 15. We remind that when K = 
1, the CTH is equivalent to the PTH. The sohd hues 
represent the rate distortion function corresponding to 
the Shannon bound. The average distortion for 100 trials 
is plotted. 

The results are quite far from the Shannon bound for 
large code rate but approaches it for small ones. How- 
ever, as K increases, the performance smoothly decreases 
implying that the number of suboptimal states steadily 
increases with the number of hidden units. Nevertheless 
it should be noted that for unbiased messages, whereas 
the BP completely fails in the PTH case for > 1, this 
is not the case here. Anyway, in the CTH case also, the 
reason for the deterioration of the performance is clearly 
linked with the number of hidden units. 



(h) K = 3, N = 102, p = 0.8 
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(c) K = 3, N = 999, p = 0.8 

Fig. 14. Overlap of the solutions given by the BP-based encoder 
for lossy compression using the PTH with R = 0.4 and 7 = 0.45 
which is set by trial and error. The Shannon bound is 0.15 for 
p = 0.5 and 0.057 for p = 0.8. (a) K = 3 and A = 102 and 
p = 0.5. The empirical distortion is 0.43. (h) K = 3 and A = 102 
and p = 0.8. The empirical distortion is 0.118. (c) K = 3 and 
A^ = 999 and p = 0.8. The empirical distortion is 0.101. 

Then we try to investigate the structure of the code- 
word space. We consider only K = 3 because K = 1 is 
equivalent to the PTH. We plot the histograms of the 
overlap of the solutions obtained using the BP in Fig- 
ure 16 (a). In this case, for K = 3^ we have four peaks. 
Two small ones around ±1 and two big ones linked by 
a plateau around ±1/3. This is the same situation as 
the PTH with K = 3, p = 0.8 and N 102. The 
four peaks corresponds to the 2^ possible combinations 
of codewords ensured by the structure of the network 
(discussed in the reference^^^). On the other hand, the 
plateau around shows that there is also many code- 
words completely uncorrelated which share very similar 
distortion properties. To confirm this conjecture, we per- 
form exactly the same experiment but with a larger code- 
word size N = 1002. Results are shown in Figure 16 (b). 
This time, the peaks vanish and we have a Gaussian like 
distribution centered on 0. This confirm the fact that 
there is a very large amount of uncorrelated codewords 
sharing the same distortion properties. We have the same 
surprising result as in the PTH case. 

To conclude the case of the CTH, we can say that 
the BP converges but with quite poor performance. Fur- 
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Fig. 15. Empirical performance of the BP-based encoder for lossy 
compression using the CTH with K = 1, K = 3 and K = 5 for 
unbiased messages (p = 0.5, on the top of the figure) and biased 
message (p = 0.8, on the bottom). Dashed lines are for K = 1, 
dotted lines are for X = 3 and dash dotted lines are for K = 5. 
We used = 1000 for AT = 1 and AT = 5, and = 999 for 
K = 3. The inertia term 7 = 0.4 was set by trial and error. The 
solid lines give the Shannon bound. The top one is for p = 0.5, 
the bottom one is for p = 0.8. 
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(b) N = 1002 

Fig. 16. Overlap of the solutions given by the BP-based encoder 
for lossy compresion using the CTH with K = 3, R = 0.4, p = 0.8 
and 7 = 0.4 which is set by trial and error. The Shannon bound is 
0.14. (a) N = 102. The empirical distortion is 0.3. (b) A^ = 1002. 
The empirical distortion is 0.22. 

thermore, as K increases, the performance smoothly de- 
teriorates. The codeword space exhibits an interesting 
structure, showing that many uncorrelated codewords 
share very similar distortion properties. As the codeword 
length gets larger, the number of these codewords seems 
to increase dramatically. However the reasons of this per- 
formance deterioration as K gets larger remains unclear. 
It is likely that the use of several hidden units induces 
structural change in the codeword space and that these 
are responsible for the BP bad behavior. 

5.2.3 Committee tree with a non-monotonic output unit 
(CTO) 

We show the results obtained for the CTO with K = 2^ 
K = 3, K = 4 and K = 5 hidden units for unbiased mes- 



sages in Figure 17 (a), where the CTO cannot be defined 
for = 1. The continuous solid line represents the rate 
distortion function corresponding to the Shannon bound. 
The average distortion for 100 trials is plotted. The re- 
sults for unbiased messages (p = 0.5) are quite similar as 
in the CTH case. The best performance is obtain for the 
smaller K and then smoothly deteriorates. However, the 
results do not deteriorate steadily (for example K = 5 
gives better performance compared to = 4). This is 
probably due to the fact that the free energy is a dis- 
continuous function of K as shown in the reference. 
This is then not surprising that the performance do not 
evolve smoothly with K. On top of that, let us remind 
that the CTO is expected to give the Shannon optimal 
performance only for an infinite number of hidden units 
K. So it is fair the results are quite far from being Shan- 
non optimal. However, as K increases, one may expect 
the performance to become closer to the Shannon bound 
but this is not the case. This shows again that a larger 
number of hidden units clearly penalizes the BP perfor- 
mance. 

Next we perform the same experiment but for biased 
messages with p = 0.8. The results are given in Figure 17 
(b). The results for biased messages with p = 0.8 exhibits 
strange behavior. The best performance for small rates 
R < 0.2 is obtained for = 3 and for R > 0.2, the best 
performance is given for K = 4. We have some strange 
jump in performance for K = 2 between R = 0.3 and 0.4 
and for K = 5 between R = 0.5 and R = 0.6 for exam- 
ple. This is probably due to the fact that the tuning of 
the threshold parameter k follows a discontinuous func- 
tion of D which can explain this kind of discontinuous 
jump. The results are hard to interpret but we observed 
that for > 5, the general tendency is to get worse per- 
formance. As mentioned earlier, the CTO is expected to 
give Shannon optimal performance only for an infinite 
number of hidden units K so it is fair for the results not 
to be Shannon optimal, especially for small K. However, 
as K increases, one may expect the performance to be- 
come closer to the Shannon bound but this is not the 
case after K = 4. This shows again that a larger number 
of hidden units clearly penalizes the BP performance. 

Then we try to investigate the structure of the code- 
word space. We show the case when K = 2 only here be- 
cause the other ones are similar. We plot the histograms 
of the overlap of the solutions obtained using the BP for 
N = 102 in Figure 18 (a). In this case, we have almost 
the same picture as in the PTH/CTH case with K = 1. 
Two small peaks around ±1 and one large plateau around 
0. The small peaks corresponds to the codewords which 
share exactly the same distortion properties as ensured 
by the mirror symmetry of the function /k (discussed in 
the reference^^^). On the other hand, the plateau around 
shows that there are also many codewords completely 
uncorrelated which share very similar distortion prop- 
erties. To confirm this conjecture, we perform exactly 
the same experiment but with a larger codeword size 
N = 1000. Results are shown in Figure 18 (b). This 
time, the small peaks around ±1 completely vanish and 
we have a Gaussian like distribution centered on 0. This 
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(a) p = 0.5 
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R 



(b) p = 0.8 

Fig. 17. Empirical performance of the BP-based encoder for lossy 
compression using the CTO with K = 2^K = ?>^K = A and 
K = h for unbiased messages {p = 0.5). Dashed line is for K = 2, 
dotted line is for K = 3, solid line is ior K = A and dash dotted 
hue is for K = 5. We used = 1000 for K = 2, 5, A = 999 for 
a: = 3, and A = 1004 for K = A. The inertia term 7 = 0.4 was 
set by trial and error. The continuous solid line (bottom) gives 
the Shannon bound, (a) p = 0.5. (b) p = 0.8. 

confirm the fact that there is a very large amount of un- 
correlated codewords sharing the same distortion prop- 
erties. 

To conclude the case of the CTO, we can say that the 
BP converges but with quite poor performance. On top of 
that, because of the discontinuous free energy, we observe 
some strange behavior like sudden jump in performance. 
The CTO theoretically gives Shannon performance for 
an infinite number of hidden units K so one may expect 
the performance given by the BP to get better and better 
as K increases however this is not the case. For > 4, 
we generally get poorer and poorer performance showing 
one more time that there is some intimate link between 
the BP performance and the number of hidden units. 
Finally, as already found for the PTH and CTH, as the 
codeword length gets larger, the number of codewords 
sharing similar distortion properties seems to increase 
dramatically. The geometrical feature of the codeword 
space remains to be investigated. 

6. Conclusion and Discussion 

We have investigated the BP algorithm as a decoder of 
an error correcting code scheme based on tree-like mul- 
tilayer perceptron encoder. In the same way, we have 
investigated the BP algorithm as a potential encoder of 
a lossy compression scheme based on tree-like multilayer 
perceptron decoder. We have discussed that whether the 
BP can give practical algorithms or not in these schemes. 
Unfortunately, the BP implementations in those kind of 
fully connected networks shows strong limitation, while 
the theoretical results seems a bit promising. Instead, it 
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Overlap (1/N * S ■ S ) 



(a) = 100 




-1 -0.5 0.5 1 

Overlap (1/N * S ■ S ') 

(b) = 1000 

Fig. 18. Overlap of the solutions given by the BP-based encoder 
for lossy compression using the CTO with K = 2, R = 0.4, 
p = 0.5 and 7 = 0.4 which is set by trial and error. The Shannon 
bound is 0.15. (a) = 100. The empirical distortion over the 
trial is 0.25. (b) = 1000. The empirical distortion over the 
trial is 0.21. 

reveals it might have a rich and complex structure of the 
solution space via the BP-based algorithms. 

While these two schemes have been shown to yield the 
Shannon optimal performance theoretically (under some 
specific conditions, Cf. the references^' ^^^), they lack a 
practical formal decoder and encoder, respectively. The 
BP algorithm has been proposed as a way to calculate the 
marginalized posterior probabilities of the relevant Boltz- 
mann factor but exhibits poor performance preventing 
this kind of schemes from being practical. The number 
of hidden units should be kept as small as possible as no 
gain have been observed by using several ones. While the 
precise reasons behind this bad behavior are still unclear 
at the present time, there is no doubt that the number 
of hidden units have some deep impact onto the solution 
space of the considered network, which is infered from 
behavior of the BP-based algorithms. It is very probable 
that the existence of mirror symmetry in the network is 
at the origin of the BP failure. It is also very likely that 
a singular structure similar to the one studied in the first 
part of this paper, prevents the standard BP algorithm 
to work efficiently. This underline the necessity to inves- 
tigate the geometrical feature of the solution space of the 
PTH/CTH/CTO as weh as the BP dynamics to under- 
stand why the BP does not work well when a large K is 
used. It would be interesting to investigate the informa- 
tion geometrical counterpart of the BP algorithm to see 
how well it can performed. This remains a future topic 
of research. 

On the other hand, as discussed in the reference^^^ and 
in the reference, the mirror symmetry seems to be a 
key factor to achieve Shannon performance while using 
perceptron like network in the lossy compression case. 
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Since we have fk{s) = //e(— s), one would expect to get 
two optimal solutions (when K = 1) or more (due to the 
possible combinations of ±si) but this is not the case. 
Using a small value of A^, the expected peaks induced by 
the structure of the network are indeed observable but a 
large concentration of uncorrelated codewords is also vis- 
ible. Using a sufficiently large those peaks completely 
vanish, and one will always get uncorrelated codewords, 
trial after trial, demonstrating that a very large amount 
of uncorrelated codewords share very similar distortion 
properties. The origin of such particular space structure 
remains unclear. In the same way, the complete failure 
of the BP in the case of the PTH with K > 1 remains 
to be investigated. We might be able to investigate such 
problems by evaluating the complexity of the systems. 
This is a part of our future works. 
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Appendix A: Derivation of the BP decoder for 
error correcting code case 

The si are Ising variables, we can reparameterize the 
above probabihties using their corresponding expecta- 
tion values for the random variable 5?, 



p\s]\y,{x};f3) 



(A-l) 
(A-2) 
(A-3) 



where ^^^u^'^^^u^'^li denotes the relevant expectation 
values at time step t. Computing the expectation is easier 
than computing the message itself. 

Using the following identity In = 2tanh~^x, in- 
dependently of the scheme and network considered, one 
can already easily derived the following set of equations. 



m 



t+i 
fill 



tanh 



M 



^tanh-im^„, + iln-|^L4) 



m*,+^ = tanh 



■ M 

E 



tanh ^ rh^^ii + ^ In 



1 

In - , , 
2 q\i{-l) 



(A-5) 



In the error correcting code case, G/c,^ is given by 



=2-'"«{^f[u-.-p)^.({\/f.,-<}) 

Note that we put j3 = 1. 



(A-6) 



A.l Parity tree with non-monotonic hidden units 
(PTH) 

In the case of the PTH Tk is given by, 

K 



-rSl ' X 



(A-7) 



Applying the Taylor expansion, this can be rewritten as 

K 



(A-8) 



where 



N/K 



/^=J2S44i (A-9) 



and we have neglected the remaining {A^^/|/' / /} of 

order 0(1/ y/N). Note that this approximation is justified 
by the fact that we suppose N ^ oo. For the same reason 
we apply the central limit theorem on the A^^ and find, 

Ar,~Ar(A^z,l-ia), (A-10) 

where 



-N/K 



K 



N/K 



nil 



We finally get 



(A-12) 



where 



K 



1 V^^ 



fk{Ki + Kii + ^i\l^-^U) 

K 



and 



Dx = 



e 2 dx 



(A-13) 



(A-14) 



Using the fact that A^^ is of order 0{l/\fN)^ we expand 
^k^yiii around and get. 



(A-15) 
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Finally, using (A-1) we get the expresion of m^^; as fol- 
lows: 



{A.16) 



Evaluating ^\ J^)) and — ^''xl ' U^^o^ we can explic- 
itly obtain m^-^. 

So using (A-4), (A-5) and (A- 16) iteratively until a 
fixed point is reached, one should be able to decode the 
received corrupted codeword y and find back the original 
message . However, this procedure still requires 0{N^) 
operations so one might want to reduce the complexity 
of the algorithm. 

For simplicity, we suppose a uniform prior q\i hereafter. 
However, the results can be easily generalized for more 
complex priors. Since m^^^ is of order 0(1/a/]V), we have 



tanh 



M 



^ tanh ^ m^, 



« m*+i - [1 - rh^,. (A.17) 

We can then evaluate the following equations using the 
above approximation. 



K 



N/K 



where 



K/N 



N 



(4)^ (A-18) 



(A.19) 



K 



=1 

N/K 



€i = E (1 - K~u^ (A-20) 



i=l 



(A-21) 



We here insert the lacking term in the partial sum 
(Si'/i ~ of cross term since this should be 

negligible for large TV. In the same way we have 



where 



-N/K 

E 



A 



-y 



N 



t 



44, (A-22) 



(A-23) 

^(l-K]2)m*^-i<^.(A.24) 



i=l 
N/K 



i=l 



Using these equations, we can rewrite and its 

derivative as a function of sji, 



(A-26) 



Then, because each eji is of order 0(1/V^), we approx- 
imate {[/* 

} by {Ul^u^Vl^-i} where we neglect 
all the terms {4/|/' 7^ which gives 

^k,fiili^u) ^ Uk,fiil{^u)i 
^k,i^il{^il) ~ ^k,i^il{^il)- 

Using this approximation, we get 



m 



(111 



r^il^k,fiil (4)' 



where we put 



( t \ _ ^k,fiili^u) 
k,iJ,il\^il) — Trt f^t \ ' 



(A.27) 
(A-28) 

(A-29) 
(A-30) 



Then once again, because e\i is of order 0(1/ vTV), we 
perform the Taylor epxansion: 



m 



jjbil 



del 



(A-31) 



For simplicity, we hereafter use the following abbrevia- 
tions: 







(A-32) 


^k,nil{^) = ^k,tih 




(A-33) 






(A-34) 
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— ^il'-'k,iily 
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— ^il ^k,iil , 


(A-36) 






Note that because we 


del, 1^^'="- 



which appear in 

neglect all the e\i, and use the value of the above func- 
tions evaluated at only, we can drop the index i. 

So using all this results we have (we suppose a uniform 
prior for simplicity). 



m, 
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^tanh \ml,i) 



■ M 
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where we put 
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(A-38) 



Neglecting small order terms, we obtain 



— — ^k,i^l^k,i^l ^k,i^l^k,nl ^ 



(A.39) 



(V^ 

lj,= l ^ k,iJ,lJ 

We therefore obtain the approximated BP equation as 
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follows: 



m^^'^ = tanh 



M 

E 



2%*,(-l) 



(A-40) 



xsm 



K 



where we have inserted back the term depending on the 
prior. We then arrive at (17). The BP algorithm is thus 
finally reduced to (A-40) and requires about 0(7V^) op- 
erations for each step. The MPM estimator at time step 
t is given by s\ = sgn(m-J. 

In this case the BP reduces to a single recurrent equa- 
tion given by (A-40), where in the case of the PTH, we 
have 



fJil 
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where 



k^fjbl 



and 



H{u) = I 

J U 



(A-44) 



Dx. 



{A-46) 



In another schemes, we first calculate U^^^n V^^^i, Ul ^^i 
and which are needed to obtain an iterative equa- 
tion of '(A-40). 

A. 2 Committee tree with non-monotonic hidden units 
(CTH) 

In the case of the CTH, Tk is given by. 



K 



sgn 



sgn 



.(A-47) 



In the same way as the PTH, we find 
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Evaluating S^^^^^/(0) and — ^^^{^ ' Uf^^o^ we obtain 
y^'{l-r-p) " 
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where 



(A-53) 



and denotes the sum over all the possible state for 
the dummy binary variables {r/} which can take the 
value =bl. 

A. 3 Committee tree with a non-monotonic output unit 
(CTO) 

In this case it should be noted that optimal perfor- 
mance are obtain only for a number of hidden unit 
K ^ oo. However we decide to investigate the perfor- 
mance given by the scheme even with a finite number of 
hidden units. In the case of the CTO Tk is given by. 



= fk 
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In the same way as the PTH, we find 
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In the same way, one can obtain 
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Appendix B: Derivation of the BP encoder for 
lossy compression 

In the lossy compression case, Gk,ij, is given by 



B.2 Committee tree with non-monotonic hidden units 
In the case of the CTH, one can find 
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according to the reference. The method to derive the 
set of BP messages is exactly same as in the error cor- 
recting cases. Thus, the BP equations are given by (A-4), 
(A-5) and (A- 16) for the standard algorithm and by 
(A-40) for the more approximated version. Only 5^, [/, [/, 
V and V change. Therefore, in lossy compression case, 
we first calculate Ul^^, V^^i, Ul^i and Vl^^ for each 
scheme. 

B.l Parity tree with non-monotonic hidden units 
(PTH) 

In the case of the PTH, using the same method as the 
error correcting case, one can find ^k^ 



where O denotes the unit step function which takes 1 for 
X > and for x < 0. Using the above equation, we 
have 
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B.3 Committee tree with a non-monotonic output unit 
(CTO) 

In this case it should be noted that optimal perfor- 
mance are obtain only for a number of hidden unit 
K ^ oo. However we decide to investigate the perfor- 
mance given by the scheme even with a finite number of 
hidden unit. We find as follows: 
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We then have 
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